A machine learning problem where the learner's objective is to predict a discrete outcome is a classification problem. Here, the objective is to identify students who might need an early intervention, in machine learning terminology a "binary classification" problem. More concretely, given the data about a student i.e. the features that describe a student, the learner must be able to classify the student into "Passed" or "Failed" so that the learner may be used to further aid in deciding whether a student "Needs an early intervention" or "Doesn't".
In [1]:
# Import libraries
import numpy as np
import pandas as pd
from IPython.display import display, HTML
from collections import defaultdict
In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns
Now, can you find out the following facts about the dataset?
Use the code block below to compute these values. Instructions/steps are marked using TODOs.
In [3]:
"""
TODO: Compute desired values - replace each '?'
with an appropriate expression/function call
"""
n_students = len(student_data.index)
n_features = len(student_data.columns)-1
n_passed = len(student_data[student_data.passed=="yes"].index)
n_failed = n_students-n_passed
grad_rate = float(n_passed)/n_students * 100
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)
In this section, we will prepare the data for modeling, training and testing.
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.
Let's first separate our data into feature and target columns, and see if any features are non-numeric.
Note: For this dataset, the last column ('passed'
) is the target or label we are trying to predict.
In [4]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1]) # all columns but last are features
target_col = student_data.columns[-1] # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print
print "Target column: {}".format(target_col)
In [5]:
X_all = student_data[feature_cols] # feature values for all students
y_all = student_data[target_col] # corresponding targets/labels
print "\nFeature values:-"
print X_all.head() # print the first 5 rows
As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes
/no
, e.g. internet
. These can be reasonably converted into 1
/0
(binary) values.
Other columns, like Mjob
and Fjob
, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher
, Fjob_other
, Fjob_services
, etc.), and assign a 1
to one of them and 0
to all others.
These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies()
function to perform this transformation.
In [6]:
# Preprocess feature columns
def preprocess_features(X):
"""
Function: Change the input categorical features into binary features
using the pandas get_dummies function
Params:
X: A pandas DataFrame containing the categorical features.
Returns:
outX: A pandas DataFrame with dummy variables created
in place of categorical features.
"""
outX = pd.DataFrame(index=X.index) # output dataframe, initially empty
# Check each column
for col, col_data in X.iteritems():
# If data type is non-numeric, try to replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['yes', 'no'], [1, 0])
# Note: This should change the data type for yes/no columns to int
# If still non-numeric, convert to one or more dummy variables
if col_data.dtype == object:
col_data = pd.get_dummies(col_data, prefix=col) # e.g. 'school' => 'school_GP', 'school_MS'
outX = outX.join(col_data) # collect column(s) in output dataframe
return outX
In [7]:
X_all = preprocess_features(X_all) #pre-process features
y_all = y_all.replace(['yes','no'],[1,0]) #pre-process target labels
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))
In [8]:
from sklearn.cross_validation import train_test_split
In [9]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0] # same as len(student_data)
num_train = 300 # about 75% of the data
num_test = num_all - num_train
# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train,X_test,y_train,y_test = train_test_split(X_all,y_all,test_size=.24,random_state=42)
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:
Produce a table showing training time, prediction time, F1 score on training set and F1 score on test set, for each training set size.
Note: You need to produce 3 such tables - one for each model.
In [10]:
# Train a model
import time
def train_classifier(clf, X_train, y_train):
"""
Function: Given the instance of a classifier and the training samples,target labels,
return the trained classifier and the time taken to train the classifier.
Params:
clf: Instance of a scikit-learn classifier.
X_train: The training data to fit the classifier instance with,
{array-like, sparse matrix}, shape (n_samples, n_features).
y_train: The target labels to train the classifier instance with,
array-like, shape (n_samples,)
Returns:
clf: The instance of a classifer trained using the training samples and target labels.
end-start: Time taken to train the classifier.
"""
#print "Training {}...".format(clf.__class__.__name__)
start = time.time()
clf.fit(X_train, y_train)
end = time.time()
#print "Done!\nTraining time (secs): {:.3f}".format(end - start)
return clf,end-start
In [11]:
# TODO: Choose a model, import it and instantiate an object
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
# Fit model to training data
clf,time_taken = train_classifier(clf, X_train, y_train) # note: using entire training set here
print clf # you can inspect the learned model by printing it
print "Training time (secs): {:.3f}".format(time_taken) # time taken to train the classifier
In [12]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score
def predict_labels(clf, features, target):
"""
Function: Given a trained classifier perform classification
on samples in features and compute the F1 score and error in classification.
Params:
clf: Instance of a scikit-learn trained classifier.
features: sample dataset to perdict the target labels,
{array-like, sparse matrix}, shape (n_samples, n_features)
target: Actual target labels of the sample dataset,
array-like, shape (n_samples,)
Returns:
end-start: Time taken to predict the target labels using the trained classifier.
f1_score : metric used to measure the performance of the classifier.
"""
#print "Predicting labels using {}...".format(clf.__class__.__name__)
start = time.time()
y_pred = clf.predict(features)
end = time.time()
#print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
return end-start,f1_score(target.values, y_pred, pos_label=1)
In [13]:
time_taken,train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {:.3f}".format(train_f1_score)
print "Training set prediction time (secs): {:.3f}".format(time_taken)
In [14]:
# Predict on test data
time_taken,test_f1_score = predict_labels(clf, X_test, y_test)
print "F1 score for test set: {:.3f}".format(test_f1_score)
print "Test set precidtion time (secs): {:.3f}".format(time_taken)
In [15]:
# Train and predict
def train_predict(clf, X_train, y_train, X_test, y_test):
"""
Function: Given the instance of a scikit-learn classifier,
training and test sets, return a dictionary containing
the time and F1 score measures on both the sets.
Params:
clf: Instance of a scikit-learn classifier.
X_train: The training data to fit the classifier instance with,
{array-like, sparse matrix}, shape (n_samples, n_features).
y_train: The target labels to train the classifier instance with,
array-like, shape (n_samples,)
X_test: The test dataset to perdict the target labels,
{array-like, sparse matrix}, shape (n_samples, n_features)
y_test: The actual target labels of the test dataset,
array-like, shape (n_samples,)
Returns:
Dict{
'train_size':*Size of the training set*,
'train_time':*Time taken to train the classifier*,
'train_score':*F1 Score, calculated on the training samples* ,
'test_time':*Time taken to predict on the test set*,
'test_score':*F1 Score, calculated on the test samples*
}
"""
clf,train_time = train_classifier(clf, X_train, y_train) #Train the classifier.
predict_train_time,predict_train_score = predict_labels(clf, X_train, y_train) #Predict on the training set
predict_test_time,predict_test_score = predict_labels(clf, X_test, y_test) #Predict on the test set.
return {"Training set size":len(X_train),
"Training time (secs)":train_time,
"F1 Score on Training Set":predict_train_score,
"Prediction time (secs)":predict_test_time,
"F1 Score on Testing Set":predict_test_score,
}
# TODO: Run the helper function above for desired subsets of training data
In [16]:
#Retrieve the metrics using different training set sizes
#Note: Keep the test set constant
def get_metrics(clf, X_train, y_train, X_test, y_test):
"""
Fuction: Computes and returns a DataFrame containing the performance measures
of the classifier for test set sizes in [100,200,300].
Params:
clf: Instance of a scikit-learn classifier.
X_train: The training data to fit the classifier instance with,
{array-like, sparse matrix}, shape (n_samples, n_features).
y_train: The target labels to train the classifier instance with,
array-like, shape (n_samples,)
X_test: The test dataset to perdict the target labels,
{array-like, sparse matrix}, shape (n_samples, n_features)
y_test: The actual target labels of the test dataset,
array-like, shape (n_samples,)
Returns:
Pandas DataFrame containing the performance metrics of the
classifier indexed by the training set size.
"""
perf = []
for size in [100,200,300]:
perf.append(train_predict(clf, X_train[:size], y_train[:size], X_test, y_test))
return pd.DataFrame.from_records(perf,index="Training set size")
In [17]:
def classifier_perf(clist,X,y,cv=None):
"""
Function: Train a list if classifiers and return
a dictionary of performance metrics.
Params:
clist: List of scikit-learn classifier classes.
X: The samples and features in the dataset,
{array-like, sparse matrix}, shape (n_samples, n_features).
y: The target labels in the dataset,
array-like, shape (n_samples,)
cv: Instance of a cross-validation generator.
Return:
When cv=None:
clfs: A Dictionary with the classifiers in clist as keys
and the performance metrics as a dataframe for each classifier.
When cv!=None:
clfs: A Dictionary with the classifiers in clist as keys
and the mean performance metrics as a dataframe for each classifier.
devs: The standard deviations in the performance measure measured across
each iteration in the cross-validation
"""
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.24,random_state=18)
if not cv:
clfs = {}
for clf in clist:
clfs[clf.__class__.__name__] = get_metrics(clf,X_train,y_train,X_test,y_test)
return clfs
else:
clfs = defaultdict(list)
devs = defaultdict(list)
for clf in clist:
for train,test in cv:
X_train,y_train,X_test,y_test = X.iloc[train],y.iloc[train],X.iloc[test],y.iloc[test]
clfs[clf.__class__.__name__].append(get_metrics(clf,X_train,y_train,X_test,y_test))
for clf,perf in clfs.iteritems():
df = pd.DataFrame()
for d in perf:
df = df.append(perf)
clfs[clf] = df.groupby(df.index).mean()
devs[clf] = df.groupby(df.index).std()
return clfs,devs
In [18]:
# Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem.
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
clist = [MultinomialNB(),LogisticRegression(),SVC()]
"""
Fit this model to the training data, try to predict labels
(for both training and test sets), and measure the F1 score.
Repeat this process with different training set sizes (100, 200, 300),
keeping test set constant.
"""
clfs = classifier_perf(clist,X_all,y_all)
In [19]:
%matplotlib inline
from matplotlib import pylab
def plot_learning_curve(clf,clf_name,devs=None):
pylab.plot(clf.index,clf["F1 Score on Training Set"],label="F1 Score on Training Set")
if devs:
pylab.fill_between(clf.index,clf["F1 Score on Training Set"]-devs[clf_name]["F1 Score on Training Set"],
clf["F1 Score on Training Set"]+devs[clf_name]["F1 Score on Training Set"],
alpha=0.1,color="r")
pylab.plot(clf.index,clf["F1 Score on Testing Set"],label="F1 Score on Testing Set")
if devs:
pylab.fill_between(clf.index,clf["F1 Score on Testing Set"]-devs[clf_name]["F1 Score on Testing Set"],
clf["F1 Score on Testing Set"]+devs[clf_name]["F1 Score on Testing Set"],
alpha=0.1,color="r")
pylab.xlabel("Testing Size")
pylab.ylabel("F1 Score")
pylab.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
pylab.grid()
pylab.title("Learning Curve for the classifier: {}".format(clf_name))
pylab.show()
What is the theoretical O(n) time & space complexity in terms of input size? What are the general applications of this model? What are its strengths and weaknesses? Given what you know about the data so far, why did you choose this model to apply?
The Naive Bayes classifier is a learning algorithm based on the Bayes's theorem. The fundamental idea is to use Bayesian reasoning to identify the most likely of a set of hypotheses. It assumes an underlying probabilistic distribution and updates the distribution using the probabilities of the outcome of training and is a generative classifier. It's "Naive" in that it assumes conditional independence of features.
The values below are referred from, A Comparative Study of Semi-naive Bayes Methods in Classification Learning
Given n-features and m-classes, the space complexity of a Naive Bayes classifier is roughly O(n*m).
Given n-features, k-samples and m-classes, the time complexity of Naive Bayes classifier is roughly
for training - O(n*k + n*c)
for testing - O(n*k*c)
Due to its computational simplicity(~Linear) and independence assumption Naive Bayes classifiers are widely used in a variety of applications that have relatively high dimensional features such as text classification, facial recognition, market segmentation and medical diagnosis.
Given that the data at hand is high dimensional (~30 features) and has comparatively lesser number of training samples (~300 samples). A low variance/high bias classifier such a Naive Bayes should perform well in prediction as it's prone to avoid overfitting and is not affected by the curse of dimensionality. Further, it's simplicity and fast computational speed makes it a favorable choice due to the limited resource available in terms of computational time. scikit-learn.naive_bayes.MultinomialNB is suitable for discrete features. Since, most of the features in the dataset is categorical and has been further discretized using dummy variables.
In [20]:
display("MultinimialNB")
display(clfs["MultinomialNB"])
print "For the entire training set the MultinomialNB classifier\
reports an F1 Score of {:.3f} on the testing set".format(clfs["MultinomialNB"]
["F1 Score on Testing Set"].loc[300])
plot_learning_curve(clfs["MultinomialNB"],"MultinomialNB")
Support Vector Machine (SVM) is a supervised machine learning algorithm that is used for both regression and classification problems. The classifier is best described as a large margin classifier. The intuition is to plot the features in an n-dimensional space and find a hyperplane that linearly separates the classes by optimizing the margin (distance between the hyperplane and the nearest points a.k.a support vectors). Since, it works only on linearly separable classes, non-linear hyperplanes are found using ingenious mathematical transformations known as "Kernels". This is often referred to as the "Kernel Trick"
A number of factors affect the complexity of the support vector machine algorithm, these include the underlying metric space used to identify the margin, the kernel used and the type of SVM optimizer used. However, the following is a ballpark reference for non-linear SVMs from Journal of Machine Learning Vol6.
Given m-samples in the training set, the space complexity is roughly O(m2)
Given m-samples in the training set, the time complexity is roughly O(m3)
Since, SVMs are used where there are complex non-linear hypotheses to be learnt, they are applied in a variety of fields such as Computer Vision, Natural Language Processing, Neuroimaging, Bio-informatics.
Since the data set is high dimensional and we have a relatively small training set size SVM seemed like an appropriate choice of learning algorithm. With an rbf/Gaussian kernel it is possible to learn non-linear hypotheses and perform relatively well in the classification task. Furthermore, we will be able to tune the margin of the decision boundary using the regularization parameter "C". scikit-learn.SVM.SVC has been implemented using LIBSVM and is quite efficient in optimization. With the above considerations SVM was chosen as one of the algorithms suitable for the problem given the data and resources.
In [21]:
display("SVM")
display(clfs["SVC"])
print "For the entire training set the Support Vector Machine classifier\
reports an F1 Score of {:.3f} on the testing set".format(clfs["SVC"]
["F1 Score on Testing Set"].loc[300])
plot_learning_curve(clfs["SVC"],"SVC")
Logistic Regression despite it's name is a stastical model that analyzes the relationship between multiple independent variables and a categorical dependent variable to estimates the probability of the output class by fitting data to a logistic curve. Although this is method is suitable for binary classification, multivariate classifications can be performed using a technique called one-vs-all where multiple models are trained and the class with the highest probability is chosen for output. Since, the model only outputs the probability of a class it's required that the we provide the threshold value that controls the output class.
The complexity of the model is essentially governed by the optimizer used to solve the optimization of the cost-function. We shall consider the the generalized linear optimiser(Conjugate Gradient) here. ref, Logistic Regression for Data Mining and High Dimensional Classification
The space complexity of logistic regression is O(1).
Given m-samples in the training set, the time complexiy is roughly (less than) O(m).
Logistic regression is ideally used for any kind of classification and is suitable as long as there is no overlapping of classes. This includes stock market analysis, market segmentation, Optical character recognition and digit recognition.
Logistic regression is a very simple learning algorithm and yet it is able to estimate complex hypotheses. The ability to do this with very little computational resources and yet arrive at resonable hypotheses is a promising strength of logistic regression that suits the learning problem at hand. Controlling the regularization parameter allows us to find a complexity of the hypotheses. The scikit-learn.linear_model.LogisticRegression is implemented to be quite efficient and with the above considerations Logistic Regression is a suitable algorithm for the given problem and data as it can prove to be cost effective and make reasonable predictions.
In [22]:
display("LogisticRegression")
display(clfs["LogisticRegression"])
print "For the entire training set the LogisticRegression classifier\
reports an F1 Score of {:.3f} on the testing set".format(clfs["LogisticRegression"]
["F1 Score on Testing Set"].loc[300])
plot_learning_curve(clfs["LogisticRegression"],"LogisticRegression")
In [23]:
"""
Produce a table showing training time, prediction time,
F1 score on training set and F1 score on test set, for each training set size.
Note: You need to produce 3 such tables - one for each model.
"""
from IPython.display import display, HTML
for k,v in clfs.iteritems():
display(k)
display(v)
The above results are based on only one cross-validation set and it is quite unclear whether the distribution of the training data played a role in the classifier F1 score. Hence, the experiments are repeated with StratifiedShuffledSplit cross-validation and the mean scores and times across the sets is considered as an appropriate measure to measure the performance of the classifiers.
In [24]:
from sklearn.cross_validation import StratifiedShuffleSplit
cv = StratifiedShuffleSplit(y_all, n_iter=30, test_size=0.24, random_state=42)
cvlist = [MultinomialNB(),LogisticRegression(),SVC()]
clfs,devs = classifier_perf(cvlist,X_all,y_all,cv=cv)
for k,v in clfs.iteritems():
display(k)
display(v)
plot_learning_curve(v,k,devs)
In [25]:
%matplotlib inline
from matplotlib import pylab
def plotClfCompare(clfs,devs=None,**kwargs):
if kwargs is not None:
wmap = {"test_score":"F1 Score on Testing Set",
"train_score":"F1 Score on Training Set",
"predict_time":"Prediction time (secs)",
"train_time":"Training time (secs)"}
for metric in [wmap[key] for key,value in kwargs.items() if value]:
for key,value in clfs.items():
pylab.plot(value.index,value[metric],label=key)
pylab.xlabel(value.index.name)
pylab.ylabel(value[metric].name)
pylab.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
pylab.grid()
title = "".join([value.index.name," vs ",value[metric].name])
pylab.title(title)
if devs:
pylab.fill_between(value.index, value[metric] - devs[key][metric],
value[metric] + devs[key][metric], alpha=0.1,color="r")
pylab.show()
In [26]:
cv = StratifiedShuffleSplit(y_all, n_iter=30, test_size=0.24, random_state=42)
cvlist = [MultinomialNB(),LogisticRegression(),SVC()]
clfs,devs = classifier_perf(cvlist,X_all,y_all,cv=cv)
plotClfCompare(clfs,devs,test_score =True,train_score=True,
predict_time=True,train_time=True)
The model is evaluated on three factors:
Its F1 score, summarizing the number of correct positives and correct negatives out of all possible cases. In other words, how well does the model differentiate likely passes from failures?
The size of the training set, preferring smaller training sets over larger ones. That is, how much data does the model need to make a reasonable prediction?
The computation resources to make a reliable prediction. How much time and memory is required to correctly identify students that need intervention?
While all this points to SVM as a suitable model, the cost is a concern, it is believed that the LogisticRegression model can perform better at a lower cost provided we are able to find the right hyperparameters. Hence we shall proceed with using the LogisticRegression model.
As discussed earlier logistic regression is a linear statistical model. Unlike regression, which outputs a continuous value, this model doesn't predict a numerical value instead, its output is a probability that the given input belongs to a certain class. It operates under the assumption that the given inputs, if plotted in a space, can be separated into two "regions", one for each class, by a linear(straight)boundary aka. the decision boundary.
Concretely, logistic regression finds a hypothesis function which outputs a probability value between [0,1] that the given input belongs to a certain class. The probability can be intuitively derived from the distance of the input point from the decision boundary. The greater the distance between the decision boundary and the given point, the greater is the probability of the point belonging to a certain class. However, the distance between a point and the decision boundary can vary from [0, ∞], this is turned to a probability between [0,1] using a logarithmic scaling function. Once such scaling function is the sigmoid function shown below.
Using the above curve the logistic model would predict a value of 1 for any point that is to the right of the decision boundary and 0 for any point that is to its left. Here, the 1 and 0 are the classes that the model determines for a given data point. for instance, 1 represents classifying the red points and 0 represents classifying the blue points.
The learning algorithm uses the training dataset to find the right hypotheses function that approximates the relationship between the input data and the output class. The right hypothesis function and decision boundary is learned using the probability that a training point is classified correctly by the model. Every wrongly classified point would result in a cost/error that the learning algorithm aims to reduce. The average probability over the entire training dataset would give the likelihood that a random data point is rightly classified by the hypotheses. The learning objective is to find an hypothesis function/decision boundary that increases this likelihood of classifying a randomly chosen point correctly and reduces the error in classification.
Since, the model is probabilistic in its output, it is required that we define a threshold value of probability for the model to make an actual prediction. In the case of the student intervention system, 0.50 seems appropriate because if there is even a 0.50 chance that a student might not pass it's justifiable to provide the student with the necessary support to graduate.
In logistic regression an hyperparameter called the regularization parameter (C) can be used to control the complexity of the model. The complexity affects the bias/variance of the model and hence we shall be using the GridSearchCV algorithm to find the optimal value of C that results in the right tradeoff between the bias and variance of the model. Further the C shall be searched using the l1 and l2 that defines how to deal with sparse and correlated predictors.
In [27]:
params = {"C":np.linspace(0.001,1,200),"penalty":["l1","l2"],"solver":["liblinear"]}
clf = LogisticRegression(random_state=42)
cv = StratifiedShuffleSplit(y_train,n_iter=25,test_size=0.20)
from sklearn.grid_search import GridSearchCV
In [28]:
gclf = GridSearchCV(clf,param_grid=params,scoring="f1",cv=cv,n_jobs=4)
gclf.fit(X_train,y_train)
Out[28]:
In [29]:
print gclf.best_params_
In [30]:
y_pred = gclf.predict(X_test)
In [31]:
print "F1 Score: {:.3f}".format(f1_score(y_test,y_pred))